13 research outputs found
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos
Multi-person 3D mesh recovery from videos is a critical first step towards
automatic perception of group behavior in virtual reality, physical therapy and
beyond. However, existing approaches rely on multi-stage paradigms, where the
person detection and tracking stages are performed in a multi-person setting,
while temporal dynamics are only modeled for one person at a time.
Consequently, their performance is severely limited by the lack of inter-person
interactions in the spatial-temporal mesh recovery, as well as by detection and
tracking defects. To address these challenges, we propose the Coordinate
transFormer (CoordFormer) that directly models multi-person spatial-temporal
relations and simultaneously performs multi-mesh recovery in an end-to-end
manner. Instead of partitioning the feature map into coarse-scale patch-wise
tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve
pixel-level spatial-temporal coordinate information. Additionally, we propose a
simple, yet effective Body Center Attention mechanism to fuse position
information. Extensive experiments on the 3DPW dataset demonstrate that
CoordFormer significantly improves the state-of-the-art, outperforming the
previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE,
and PVE metrics, respectively, while being 40% faster than recent video-based
approaches. The released code can be found at
https://github.com/Li-Hao-yuan/CoordFormer.Comment: ICCV 202
Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN
Source at https://proceedings.neurips.cc/paper/2021/hash/151de84cca69258b17375e2f44239191-Abstract.html.Image-based virtual try-on is one of the most promising applications of human-centric image generation due to its tremendous real-world potential. Yet, as most try-on approaches fit in-shop garments onto a target person, they require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability. While a few recent works attempt to transfer garments directly from one person to another, alleviating the need to collect paired datasets, their performance is impacted by the lack of paired (supervised) information. In particular, disentangling style and spatial information of the garment becomes a challenge, which existing methods either address by requiring auxiliary data or extensive online optimization procedures, thereby still inhibiting their scalability. To achieve a scalable virtual try-on system that can transfer arbitrary garments between a source and a target person in an unsupervised manner, we thus propose a texture-preserving end-to-end network, the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN), that facilitates real-world unpaired virtual try-on. Specifically, to disentangle the style and spatial information of each garment, PASTA-GAN consists of an innovative patch-routed disentanglement module for successfully retaining garment texture and shape characteristics. Guided by the source person's keypoints, the patch-routed disentanglement module first decouples garments into normalized patches, thus eliminating the inherent spatial information of the garment, and then reconstructs the normalized patches to the warped garment complying with the target person pose. Given the warped garment, PASTA-GAN further introduces novel spatially-adaptive residual blocks that guide the generator to synthesize more realistic garment details. Extensive comparisons with paired and unpaired approaches demonstrate the superiority of PASTA-GAN, highlighting its ability to generate high-quality try-on images when faced with a large variety of garments(e.g. vests, shirts, pants), taking a crucial step towards real-world scalable try-on
GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning
Image-based Virtual Try-ON aims to transfer an in-shop garment onto a
specific person. Existing methods employ a global warping module to model the
anisotropic deformation for different garment parts, which fails to preserve
the semantic information of different parts when receiving challenging inputs
(e.g, intricate human poses, difficult garments). Moreover, most of them
directly warp the input garment to align with the boundary of the preserved
region, which usually requires texture squeezing to meet the boundary shape
constraint and thus leads to texture distortion. The above inferior performance
hinders existing methods from real-world applications. To address these
problems and take a step towards real-world virtual try-on, we propose a
General-Purpose Virtual Try-ON framework, named GP-VTON, by developing an
innovative Local-Flow Global-Parsing (LFGP) warping module and a Dynamic
Gradient Truncation (DGT) training strategy. Specifically, compared with the
previous global warping mechanism, LFGP employs local flows to warp garments
parts individually, and assembles the local warped results via the global
garment parsing, resulting in reasonable warped parts and a semantic-correct
intact garment even with challenging inputs.On the other hand, our DGT training
strategy dynamically truncates the gradient in the overlap area and the warped
garment is no more required to meet the boundary constraint, which effectively
avoids the texture squeezing problem. Furthermore, our GP-VTON can be easily
extended to multi-category scenario and jointly trained by using data from
different garment categories. Extensive experiments on two high-resolution
benchmarks demonstrate our superiority over the existing state-of-the-art
methods.Comment: 8 pages, 8 figures, The IEEE/CVF Computer Vision and Pattern
Recognition Conference (CVPR
Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN
Image-based virtual try-on is one of the most promising applications of human-centric image generation due to its tremendous real-world potential. Yet, as most try-on approaches fit in-shop garments onto a target person, they require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability. While a few recent works attempt to transfer garments directly from one person to another, alleviating the need to collect paired datasets, their performance is impacted by the lack of paired (supervised) information. In particular, disentangling style and spatial information of the garment becomes a challenge, which existing methods either address by requiring auxiliary data or extensive online optimization procedures, thereby still inhibiting their scalability. To achieve a scalable virtual try-on system that can transfer arbitrary garments between a source and a target person in an unsupervised manner, we thus propose a texture-preserving end-to-end network, the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN), that facilitates real-world unpaired virtual try-on. Specifically, to disentangle the style and spatial information of each garment, PASTA-GAN consists of an innovative patch-routed disentanglement module for successfully retaining garment texture and shape characteristics. Guided by the source person's keypoints, the patch-routed disentanglement module first decouples garments into normalized patches, thus eliminating the inherent spatial information of the garment, and then reconstructs the normalized patches to the warped garment complying with the target person pose. Given the warped garment, PASTA-GAN further introduces novel spatially-adaptive residual blocks that guide the generator to synthesize more realistic garment details. Extensive comparisons with paired and unpaired approaches demonstrate the superiority of PASTA-GAN, highlighting its ability to generate high-quality try-on images when faced with a large variety of garments(e.g. vests, shirts, pants), taking a crucial step towards real-world scalable try-on
M3D-VTON: A Monocular-to-3D Virtual Try-On Network
Virtual 3D try-on can provide an intuitive and realistic view for online shopping and has a huge potential commercial value. However, existing 3D virtual try-on methods mainly rely on annotated 3D human shapes and garment templates, which hinders their applications in practical scenarios. 2D virtual try-on approaches provide a faster alternative to manipulate clothed humans, but lack the rich and realistic 3D representation. In this paper, we propose a novel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on the merits of both 2D and 3D approaches. By integrating 2D information efficiently and learning a mapping that lifts the 2D representation to 3D, we make the first attempt to reconstruct a 3D try-on mesh only taking the target clothing and a person image as inputs. The proposed M3D-VTON includes three modules: 1) The Monocular Prediction Module (MPM) that estimates an initial full-body depth map and accomplishes 2D clothes-person alignment through a novel two-stage warping procedure; 2) The Depth Refinement Module (DRM) that refines the initial body depth to produce more detailed pleat and face characteristics; 3) The Texture Fusion Module (TFM) that fuses the warped clothing with the non-target body part to refine the results. We also construct a high-quality synthesized Monocular-to-3D virtual try-on dataset, in which each person image is associated with a front and a back depth map. Extensive experiments demonstrate that the proposed M3D-VTON can manipulate and reconstruct the 3D human body wearing the given clothing with compelling details and is more efficient than other 3D approaches
XFormer: Fast and Accurate Monocular 3D Body Capture
We present XFormer, a novel human mesh and motion capture method that
achieves real-time performance on consumer CPUs given only monocular images as
input. The proposed network architecture contains two branches: a keypoint
branch that estimates 3D human mesh vertices given 2D keypoints, and an image
branch that makes predictions directly from the RGB image features. At the core
of our method is a cross-modal transformer block that allows information to
flow across these two branches by modeling the attention between 2D keypoint
coordinates and image spatial features. Our architecture is smartly designed,
which enables us to train on various types of datasets including images with
2D/3D annotations, images with 3D pseudo labels, and motion capture datasets
that do not have associated images. This effectively improves the accuracy and
generalization ability of our system. Built on a lightweight backbone
(MobileNetV3), our method runs blazing fast (over 30fps on a single CPU core)
and still yields competitive accuracy. Furthermore, with an HRNet backbone,
XFormer delivers state-of-the-art performance on Huamn3.6 and 3DPW datasets